922 research outputs found

    Semantic Embedding Space for Zero-Shot Action Recognition

    Full text link
    The number of categories for action recognition is growing rapidly. It is thus becoming increasingly hard to collect sufficient training data to learn conventional models for each category. This issue may be ameliorated by the increasingly popular 'zero-shot learning' (ZSL) paradigm. In this framework a mapping is constructed between visual features and a human interpretable semantic description of each category, allowing categories to be recognised in the absence of any training data. Existing ZSL studies focus primarily on image data, and attribute-based semantic representations. In this paper, we address zero-shot recognition in contemporary video action recognition tasks, using semantic word vector space as the common space to embed videos and category labels. This is more challenging because the mapping between the semantic space and space-time features of videos containing complex actions is more complex and harder to learn. We demonstrate that a simple self-training and data augmentation strategy can significantly improve the efficacy of this mapping. Experiments on human action datasets including HMDB51 and UCF101 demonstrate that our approach achieves the state-of-the-art zero-shot action recognition performance.Comment: 5 page

    Chinese loans in Old Vietnamese with a sesquisyllabic phonology

    Get PDF
    While consonant clusters, taken broadly to include presyllables, are commonly hypothesized for Old Chinese, little direct evidence is available for establishing the early forms of specific words. This essay examines a hitherto overlooked source: Old Vietnamese, a language substantially attested in a single document, which writes certain words, monosyllabic in modern Vietnamese, in an orthography suggesting sesquisyllabic phonology. For a number of words loaned from Chinese, Old Vietnamese provides the only testimony of the form of the Vietic borrowing. The small list of currently known sesquisyllabic words of Chinese origin attested in this document includes examples of both words with a secure initial Chinese cluster and words with plausible Vietic-internal prefixation

    Positional Label for Self-Supervised Vision Transformer

    Full text link
    Positional encoding is important for vision transformer (ViT) to capture the spatial structure of the input image. General effectiveness has been proven in ViT. In our work we propose to train ViT to recognize the positional label of patches of the input image, this apparently simple task actually yields a meaningful self-supervisory task. Based on previous work on ViT positional encoding, we propose two positional labels dedicated to 2D images including absolute position and relative position. Our positional labels can be easily plugged into various current ViT variants. It can work in two ways: (a) As an auxiliary training target for vanilla ViT (e.g., ViT-B and Swin-B) for better performance. (b) Combine the self-supervised ViT (e.g., MAE) to provide a more powerful self-supervised signal for semantic feature learning. Experiments demonstrate that with the proposed self-supervised methods, ViT-B and Swin-B gain improvements of 1.20% (top-1 Acc) and 0.74% (top-1 Acc) on ImageNet, respectively, and 6.15% and 1.14% improvement on Mini-ImageNet
    • …
    corecore